This guide provides an overview of what a Differential Expression Analysis report created by the BigMind team contains. For the integrated and automated analysis, we utilized functions from our custom R package, OmicsKit.

1. Preprocessing: Counts per Sample

Visualizing Samples Data

Sample metadata is crucial for connecting bioinformatics results to biological significance. The first step in our analysis is to clean the data and prepare it for downstream library requirements.

ID Test Condition Group Sex Age RIN
1 Undetected Normal A F 65 6.0
2 Undetected Normal A F 42 6.0
3 Undetected Normal A M 63 5.5
4 Undetected Normal A M 49 5.8
5 Undetected Disease B M 51 5.2
6 Undetected Disease B M 53 6.1
7 Detected Normal C F 31 5.5
8 Detected Normal C F 54 5.5
9 Detected Normal C M 43 5.3
10 Detected Normal C M 67 5.7
11 Detected Normal C M 59 6.1
12 Detected Normal C M 46 5.2
13 Detected Normal C M 40 5.5
14 Detected Disease D F 70 5.6
15 Detected Disease D F 36 5.9
16 Detected Disease D F 49 6.0
17 Detected Disease D F 64 6.6
18 Detected Disease D F 66 5.6
19 Detected Disease D M 57 5.4
20 Detected Disease D M 51 6.4
21 Detected Disease D M 48 6.7

Managing Count Files

Identifying samples with abnormal counts can highlight outliers that may need to be excluded from analysis to avoid skewed results.

## reading in files with read_tsv
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
## summarizing abundance
## summarizing counts
## summarizing length

0 5,000,000 10,000,000 15,000,000 20,000,000 25,000,000 30,000,000 35,000,000 40,000,000 1 10 11 12 13 14 15 16 17 18 19 2 20 21 3 4 5 6 7 8 9 Sample ID Filtered Group A B C D

2. Exploring confounders

Consider variables such as age, sex, batch effects, or other biological and technical factors that might impact gene expression. ## Heatmap samples vs samples 4 14 3 19 10 1 8 2 18 11 17 15 7 16 9 5 6 13 21 12 20 4 14 3 19 10 1 8 2 18 11 17 15 7 16 9 5 6 13 21 12 20 0 20 40 60 80

PCA: Principal Component Analysis

pca12 <- nice_PCA(transf.data, PCs = c(1, 2),
                  ntop =  nrow(assay(transf.data)),
                  variables = c("gp", "sex"),
                  legend_names = c("Group", "Sex"),
                  size = 9, alpha = 1, colors = nice_colors,
                  shapes = 21:22, legend_title = 10,
                  legend_elements = 8, legend_pos = NULL,
                  labels = c(var = "num", size = 3))

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -30 -20 -10 0 10 20 30 -20 0 20 40 PC1: 11% variance PC2: 9% variance Group An Ap Bn Bp Sex F M

tSNE: t-Distributed Stochastic Neighbor Embedding

p.tsne <- nice_tSNE(object = transf.data, seed = seed, perplexity = plex,
                    max_iterations = iter, returnData = FALSE,
                    variables = c("group", "sex"), colors = nice_colors,
                    shapes = 21:22, size = 9, alpha = 1,
                    legend_names = c("Group", "Sex"),
                    legend_title = 10, legend_elements = 8, legend_pos = NULL,
                    labels = c(var = "num", size = 3))
## sigma summary: Min. : 0.592256294214936 |1st Qu. : 0.80258671064056 |Median : 0.883021151140274 |Mean : 0.95360027035352 |3rd Qu. : 1.06449277487991 |Max. : 1.58782782421574 |
## Epoch: Iteration #1000 error is: 0.800904491733047
## Epoch: Iteration #2000 error is: 0.509953545086777
## Epoch: Iteration #3000 error is: 0.506591721778965
## Epoch: Iteration #4000 error is: 0.506573987936383
## Epoch: Iteration #5000 error is: 0.506572323927411
## Epoch: Iteration #6000 error is: 0.50657227929598
## Epoch: Iteration #7000 error is: 0.506572278941316
## Epoch: Iteration #8000 error is: 0.50657227892593
## Epoch: Iteration #9000 error is: 0.50657227890614
## Epoch: Iteration #10000 error is: 0.506572278880658

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 X1 X2 Sex F M Group A/Tc- A/Tc+ B/Tc- B/Tc+

UMAP: Uniform Manifold Approximation and Projection

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 X1 X2 Sex F M Group A/Tc- A/Tc+ B/Tc- B/Tc+

3. Differential Expression Analysis

DESeq2 is a widely used tool for analyzing differential gene expression in RNA-seq data. We performed our analysis adjusting the analysis and improving filtering. ## Create Comparisons

## using pre-existing normalization factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Display results

baseMean log2FoldChange lfcSE stat pvalue padj ensembl symbol description biotype
ENSG00000000003 4.8597885 -1.3065575 0.7629200 -1.7125747 0.0867908 NA ENSG00000000003 TSPAN6 tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858] protein_coding
ENSG00000000005 0.0306208 0.5719487 4.3949775 0.1301369 0.8964581 NA ENSG00000000005 TNMD tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757] protein_coding
ENSG00000000419 183.1677539 -0.0499737 0.1492696 -0.3347882 0.7377848 0.9888945 ENSG00000000419 DPM1 dolichyl-phosphate mannosyltransferase subunit 1, catalytic [Source:HGNC Symbol;Acc:HGNC:3005] protein_coding
ENSG00000000457 239.4758468 0.0334553 0.1156794 0.2892072 0.7724228 0.9921239 ENSG00000000457 SCYL3 SCY1 like pseudokinase 3 [Source:HGNC Symbol;Acc:HGNC:19285] protein_coding
ENSG00000000460 32.1310287 0.1098629 0.2650042 0.4145706 0.6784562 0.9813005 ENSG00000000460 C1orf112 chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:HGNC:25565] protein_coding
ENSG00000000938 4095.0035816 -0.0322014 0.1230765 -0.2616373 0.7936011 0.9927180 ENSG00000000938 FGR FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC Symbol;Acc:HGNC:3697] protein_coding

Split cases

When performing differential expression analysis of a study with 3 phenotypes, including the baseline, there are 10 mutually exclusive cases where genes can fall into.

split_cases <- function (df.BvsA = NULL, df.CvsA = NULL,
                         df.BvsC = NULL, unique_id = "ensembl",
                         significance_var = "padj", significance_cutoff = 0.25,
                         change_var = "log2FoldChange", change_cutoff = 0)

Detectability

This function identifies genes with measurable expression levels across samples.

detect_genes <- function(norm.counts, df.BvsA, df.CvsA = NULL,
                         df.DvsA = NULL, cutoffs = c(50, 50, 0),
                         samples.baseline, samples.condition1,
                         samples.condition2 = NULL, samples.condition3 = NULL)

4. Plotting

Heatmap samples vs genes

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 PLAGL1 XYLT2 BCDIN3D RNF170 SAT2 SYNJ1 SHQ1 CDK16 POLR2J4 DDIT4 TTC39C TXNIP KIF5C RPL13P12 TPRN GCHFR PAM LIG4 PRORP IL7R FRG1BP HBG1 ZFP36L2 TMEM140 CDC14A FAM126A RRM2B SOCS1 Group Group A/Tc- A/Tc+ B/Tc+ -3 -2 -1 0 1 2 3

Volcano plots

x = -1 x = 1 q = 0.25 UNC93B2 EIF3CL ENSG00000257379 C17orf58P RAD51L3-RFFL KTI12-H ENSG00000288643 ENSG00000288684 XYLT2 TPRN HBG1 POLR2J4 1 2 3 4 -10 -6 -2 2 6 10 log 2 (Fold Change) -log 10 FDR Expression Up-regulated Not changed Down-regulated

Box-Scatter-Violin (BSV) plots

**** ** ns 5 6 7 8 9 A/Tc- A/Tc+ B/Tc+ log 2 (Norm. Counts) POLR2J4

5. GO Enrichment

GO enrichment analysis with clusterProfiler helps interpret differential expression results by identifying biological pathways that are overrepresented in differentially expressed genes, providing insights into the underlying biological mechanisms.

Balloon plots

GO terms Networks

6. Prognosis Model

Linear regression models utilize RNA-seq data to identify relationships between gene expression levels and clinical outcomes, enabling the development of prognosis models that predict disease progression or patient survival based on gene expression profiles. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 False Positive Rate True Positive Rate Cathegory A/Tc- (AUC = 0.84) A/Tc+ (AUC = 0.77) B/Tc+ (AUC = 0.88) ROC curves for disease progression